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Abstract 

Recognizing the successes of treed Gaussian process (TGP) models as an inter- 
pretable and thrifty model for nonparametric regression, we seek to extend the model 
to classification. Both treed models and Gaussian processes (GPs) have, separately, en- 
joyed great success in application to classification problems. An example of the former 
is Bayesian CART. In the latter, real-valued GP output may be utilized for classification 
via latent variables, which provide classification rules by means of a softmax function. 
We formulate a Bayesian model averaging scheme to combine these two models and 
describe a Monte Carlo method for sampling from the full posterior distribution with 
joint proposals for the tree topology and the GP parameters corresponding to latent 
variables at the leaves. We concentrate on efficient sampling of the latent variables, 
which is important to obtain good mixing in the expanded parameter space. The tree 
structure is particularly helpful for this task and also for developing an efficient scheme 
for handling categorical predictors, which commonly arise in classification problems. 
Our proposed classification TGP (CTGP) methodology is illustrated on a collection of 
synthetic and real data sets. We assess performance relative to existing methods and 
thereby show how CTGP is highly flexible, offers tractable inference, produces rules 
that are easy to interpret, and performs well out of sample. 

keywords: treed models, Gaussian process, Bayesian model averaging, latent variable 



1 Introduction 



A Gaussian process (GP) (e.g., 



Rasmussen and Williams 



20061 ) is a popular nonparamet- 



ric model for regression and classification that specifies a prior over functions. For ease of 
computation, typical priors often confine the functions to stationarity. While stationarity 
is a reasonable assumption for many data sets, still many more exhibit only local station- 
arity. In the latter case, a stationary model is inadequate, but a fully nonstationary model 
is undesirable as well. Inference can be difficult due to a nonstationary model's complex- 
i ty, much of which is un necessary to obtain a good fit. A treed Gaussian process (TGP) 



(IGramacy and Lee 



20081 ). in contrast, can take advantage of local trends more efficiently. 
It defines a treed partitioning process on the predictor space and fits distinct, but hierar- 
chically related, stationary GPs to separate regions at the leaves. The treed form of the 
partition makes the model particularly interpretable. At the same time, the partitioning 
results in smaller matrices for inversion than would be required under a standard GP model 
and thereby provides a nonstationary model that actually facilitates faster inference. 

Recognizing the successes of TGP for regression, we seek to extend the model to classifi- 
cation. Separately, both treed models and GPs have already been been successfully applied 
to classification. Bayesian methods in the first case outline a tree prior and series of propos- 



als for tree manipulati on in a Monte Carlo sampling framewor 



( jChipman et al. 



leading to B a yesian CART 



19981) — extending the classical version due to 



Breiman et al. 



( 119841 ). In the 



second c ase lies a ra ethod by which real- valued GP output may be utilized in a classification 



context f Neal 



19981 1: for some number of classes, the real outputs of a commensurate number 
of GPs (M — 1 for M classes) are taken as priors for latent variables that yield probabilities 
via a softmax function. This leads to two ways to combine trees with GPs for classification. 
The data may be partitioned once by the tree process, and then we may associate M — 1 GPs 
to each region of the partition, or instead, M — 1 separate full TGPs may be instantiated. 
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We propose taking the latter route in the interests of faster mixing. We describe a Monte 
Carlo sampling scheme to summarize the posterior predictive distribution. The algorithm 
traverses the full space of classification TGPs using joint proposals for tree-topology modifi- 
cations and the GP parameters at the leaves of the tree. We explore schemes for efficiently 
sampling the latent variables, which is important to obtain good mixing in the (significantly) 
expanded parameter space compared to the regression case. 

The remainder of the paper is outlined as follows. We shall first review the GP model for 
regression and classification in Section O Section [3] begins with a review of the extension of 
GPs to nonstationary regression models by treed partitioning, covering the basics of inference 
and prediction with particular focus on the Monte Carlo methods used to integrate over the 
tree structure. We then discuss how categorical inputs may be dealt with efficiently in this 
framework. This discussion represents a new contribution to the regression literature and, 
we shall argue, is particularly relevant to our classification extensions; it is the categorical 
inputs that can most clearly benefit from the interpretation and speed features offered by 
treed partitioning. We are then able to describe the TCP model for classification in detail 
in Section 13.31 In Section H] we illustrate our proposed methodology on a collection of 
synthetic and real data sets. We assess performance relative to existing methods and thereby 
demonstrate that this nonstationary classification model is highly fiexible, offers tractable 
inference, produces rules that are easy to interpret, and (most importantly) performs well 
out of sample. Finally, in Section |5] we conclude with a short discussion. 



2 Gaussian processes for regression and classification 



Gaussian process (GP) models are highly flexible and nonlinear mode 
applied in regression and class iflcation contexts for over a decade (e.g.. 



s that have been 



Neal 



1997 



1998 



Rasmussen and Williams 



20061 ). For real- valued P-dimensional inputs, a GP is formally a 
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prior on the space of functions Z : 



such that the fun ction values at any finite set of 



input points x have a joint Gaussian distribution ( IStein 



19991 ) ■ A particular GP is defined by 



its mean function and correlation function. The mean function = E(Z(x)) is often con- 
stant or linear in the explanatory variable coordinates: fi{x) = P'^f{x), where f{x) = [l,x]. 
The correlation function is defined as K{x,x') = a^^E {[Z(x) — fi{x)][Z(x') — ^{x')]^). 

We further assume that the correlation function can be decomposed into two components: 
an underlying strict correlation function K* and a noise term of constant and strictly positive 
size g that is independently and identically distributed at the predictor points. 



Here, 6ij is the Kronecker delta, and we call g the nugget. In the model, the nugget term 
represents a source of measurement error. Writing the GP as Z{x) = fi{x) + w{x) + t]{x), 
we have fi{x) as the fixed mean function, w{x) as a zero- mean Gaussian random variable 
with correlation K*{x), and ri{x) as an i.i.d. Gaussian noise process. Computationally, the 
nugget term helps ensure that K remains nonsingular, allowing the frequent inversion of 
K without concern for numerical instability that might otherwise plague efficient sampling 
from the Bayesian posterior as necessary for the analysis that follows. 

Due to its simplicity, a popular choice for K*{x, x') is the squared exponential correlation 

K*{x, x') = exp 

where the strictly positive parameter d describes the range (or length-scale) of the process. 
I.e., d governs the rate of decay of correlation as the distance ||x — x'\ \ between locations x 
and x' increases. The underlying (Gaussian) process that results with this choice of K is 
both stationary and isotropic. It is easily extended to incorporate a vector-valued parameter 
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so that correlation may decay at rates that differ depending on direction: 



K*{x, x') = exp 



p=i 



The resulting process is still stationary, but it is no longer i s otrop ic. Furthe r discu ssion of 



correlation structures for GPs can be found in 



structures include the popular Matern class (IMatern 



Abrahamsen 


(1997 


) or 


Stein 


(1999) 



19861 ). which allows the smoothness 



of the process to be parameterized. The choice of correlation function is entirely up to the 
practitioner — the methods described herein apply generally for any K{-,-). 

The GP model, described above fo r the proble m of regression, can be extended to classifi- 



cation by introducing latent variables (INeal 



19971 ). In this case, the data consist of predictors 



X and classes C. Suppose there are data points and M classes. We introduce M sets of 
latent variables {Zm}m=i, one for each class. For a particular class m, the generative model 
is a GP as before: ~ A/Ar(;Um(X), ^'^(X, X)). The class probabilities are now obtained 
from the latent variables via a softmax function 



p{C{x) 



m] 



exp{-Zm{x)) 

Em'=iexp(-^m'(^)) 



The generative model then specifies that the class at a point is drawn from a single-trial 
multinomial distribution with these probabilities. In practice, only M — 1 GPs are required 
since we may fix the latent variables of the final class to zero without loss of generality. 
In this model, it remains to place priors, perhaps of a hierarchical nature, on the parame- 



ters of each G 
also remains. 



The chal 



enge of efficiently sampling from the posterior of the latent variables 



Neall (119971 ) suggests sampling the hyperparameters with a Gaussian random 



walk or by the hybrid Monte Carlo method. Our choice of hierarchica l model, in contrast 



allows almost entirely Gibbs sampling moves (IGramacy and Lee 



20081 ). Neal suggests sam- 
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pling the latent variables (individually) with adaptive rejection sampling (IGilks and Wild 



19921 ). We prefer to use Metropolis-Hastings (MH) draws with a carefully chosen proposal 
distribution and consider block-sampling the latent variables (see Section r3.3.3p . 

The GP model, for both regression and classification, features some notable drawbacks. 
First, as mentioned above, the popular correlation functions are typically stationary whereas 
we may be interested in data which are at odds with the stationarity assumption. Second, the 
typical correlation functions are clumsy for binary or categorical inputs. Finally, the frequent 
inversions of a correlation matrix, an 0{N^) operation, are necessary for GP modeling, which 
makes these analyses computationally expensive. 



3 Treed Gaussian processes 

Partitioning the predictor space addresses all three of the potential drawbacks mentioned 
above and offers a natural blocking strategy (suggested by the data) for sampling the latent 
variables as required in the classification model. In particular, after partitioning the predictor 
space, we may fit different, independent models to the data in each partition. Doing so can 
account for nonstationarity across the full predictor space. Partitions can occur on categorical 
inputs; separating these inputs from the GP predictors allows them to be treated in classical 
CART fashion. Finally, partitions result in smaller local covariance matrices, which are more 
quickly inverted. 



3.1 TGP for regression 



CAR T flBreiman et al. 



1998 



19841 ) and BCART (for Bayesian CART) fIChipman et al 
20021 ) for regression essentially use the same partitioning scheme as TGP but with simpler 
models fit to each partition described by the tree. More precisely, each branch of the tree — 
in any of these models — divides the predictor space in two. Consider predictors x G M^; 
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for some split dimension p G {1, . . . , P} and split value v, points with Xp < v are assigned 
to the left branch, and points with Xp > v are assigned to the right branch. Partitioning 
is recursive and may occur on any input dimension p, so arbitrary axis-aligned regions 
in the predictor space may be defined. Conditional on a treed partition, models are fit 
in each of the leaf regions. In CART the underlying models are "constant" in that only 
the mean and standard deviation of the real-valued outputs are inferred. The underlying 
CART tree is "grown" according to one of many decision theoretic heuristics and may be 



"pruned" using c ross-v alidation r nethods. In BCART, t hese models may be either constant 

2) 



(iChipman et al 



19981 ) or linear fIChipman et al. 



20021 ) and, by contrast with CART, the 



partitioning structure is determined by Monte Carlo inference on the joint posterior of the 
tree and the models used at the leaves. In regression TCP (hereafter RTGP), the leaf models 
are CPs, but otherwise the setup is identical to BCART. Note that the constant and linear 
models are special cases of the CP model. Thus RTGPs encompass BCART for regression 
and may proceed according to a nearly identical Monte Carlo method, described shortly. 
For implementation and inference in the RTGP model class, we take ad yantage of the 



open s ource tgp package (IGramacy and Taddy 



20081 ) for R available on GRAN 



R Development Core Team 



(120081 ). Details of th e computi n g tech niques, algorithms, default parameters, and illustra- 



tions are provided by 



Gramacvl (120071 ). 



3.1.1 Hierarchical model 

The hierarchical mo del for the RTGP be gins with the tree prior. We define this prior 
recursively following 



Chipman et al. 



(119981 ). To assemble tree T, start with a single (root) 
node. Form a queue with this root node. Recursively, for each node i] in the queue, decide to 
split f] with some probability Pspiit(7~, rj). If 77 splits, first choose a splitting rule for 77 according 
to the distribution Pruic(7~, 77), and then put the new children of rj in the queue. It remains 
to define the distributions Pspiit('7~, rj) and Pruie(7~, r/). The choice Pspiit(7~, r/) = a(l + D^)~^ , 
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with a G (0, 1) and /3 > 0, includes a base probability of splitting a and a penalty factor 
(1 + which is exponential in the node depth in T. A simple distribution Pruie over 

split-points is first uniform in the coordinates of the explanatory variable space and then 
uniform over splitting values that result in distinct data sets at the child nodes. When non- 
informative priors are used at the leaves of the tree it is sensible to have the tree prior enforce 
a minimum data requirement for each region of the partition(s). We take the tgp defaults 
in this respect, which automatically determine an appropriate minimum as a function of the 
predictor dimension P. 

Now let r G {1, . . . , K\ index the R non-overlapping regions partitioned by the tree T 
formed from the above procedure. In the regression problem, each region contains A^(r) 
points of data: {Xr^Zr\ is defined to be an extension to the predictor matrix with an 
intercept term: = (l,^^); thus, Fr has P +1 columns. A "constant mean" may be 
obtained with = 1; in this case, P = in Eq. ([2]) below. The generative model for the 
GP in region r incorporates the multivariate normal (A/"), inverse-gamma (IG), and Wishart 
(W) distributions as follows. 

Zr\^,a^,,Kr ^ ArNir){Fr^,a'rKr) /3, | a^, T^, 1^, /3o ~ A^p+i (/3o, a^T^PF ) 

~ lG(a J2, qj2) /^o ~ Afp+iifi, B) (2) 

~ IGK/2,g./2) W'^ ~ W{{pV)-\p) 

The hyperparameters fi, B,V, p,a^,qfj,ar,qT are constant in the model, for which we take 
the tgp defaults. 

3.1.2 Estimation 

Our aim in this section is to approximate the joint distribution of the tree structure T, the R 
sets of GP parameters {Or}re{i,...,R} in each region depicted by T, and GP hyperparameters 
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6q (those variables in Eq. ([2]) that are not treated as constant but also not indexed by r). 
To do so, we sample from this distribution using Markov Chain Monte Carlo (MCMC). 
We sequentially draw ^o|rest, 6'r|rest for each r = 1,...,R, and T|rest. All parameters 
{6r, r = 1, . . . ,R) and hyperparameters of the CPs can be sampled with Gibbs steps, with 



the exception of t h e coy ariance function parameters {dr,gr}- Expression s 



Gramacy and Led (120081 ). and full derivations are provided by 



are provided by 



Gramacyl (120051 ). which are 



by now quite standard in the literature so they are not reproduced here. 

Monte Carlo integration over tree space, conditional on the CP parameters Or, r = 
1, . . . , i? is, by contrast, more involved. Difficulties arise when we randomly draw a new tree 
structure from the distribution T|rest because it is possible that the new tree may have more 
leaf nodes (or fewer) than before. Changing the number of leaf nodes changes the dimension 
of 6' = {6i, . . . ,6fi), so simple MH draws are not sufficient for T. Instead, reversible jump 
Markov Chain Monte Carlo (RJ-MCMC) a llows a principled transition between models of 



different sizes (IRichardson and Greenl . 



19971). 



However, by choosing the CP parameters at the leaf nodes of the proposal tree carefully, 
the distinction between MH and RJ-MCMC can be effectively ignored. In the case where 
the number of leaf nodes does not change between the current tree state and the proposal 
tree, maintaining the same collection of {6r} allows us to ignore the RJ-MCMC Jacobian 
factor. In the case where the number of leaf nodes increases — and the increase will always 
be by exactly one leaf node — drawing the CP parameters from their priors similarly sets 
the Jacobian factor in t he RJ-MCMC acc e ptanc e ratio to unity; this leaves just the usual 



MH acceptance ratio. 



Gramacy and Led (120081 ) describe proposals that facilitate moves 



throughout the space of possible trees and how these moves are made reversible. The moves 
are called grow, prune, change, and swap, where the final move has a special sub-case rotate 
to increase the resulting MC MC acceptance ratio. This list of four moves is largely the 



same as in the BCART model (jChipman et al. 



19981 ). with the exception of rotate. During 
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MCMC, the moves are chosen uniformly at random in each iteration. 



3.1.3 Prediction 

From the hierarchical model in Section [21 we can solve for the predictive distribution of the 
outputs Z. Conditional on the covariance structure, standard multivariate normal theory 
gives that the predicted value of Z(x G region r) is normally distributed with 

E(Z(a;)| data,x e region r) = f {x)~^r + Ki^Y K^{Zr - FrPr), 
Var(Z(x)| data, X G region r) = a'^[K,{x,x) — qj {x)C^^qr{x)], (3) 

where C;^ = {Kr + T^FrWFj)-^ = kr{x) + T^FrWf{x) 

K{x,y) = Kr{x,y) + T^f{x)Wf{y) 

Here, we have f~^{x) = {l,x'^) under the model with a linear mean, or /"""(x) = 1 under 



the constant mean. Further, kr{x) is an n^-long vector with k j.^j(x) = Kr(x,Xj 



Gramacy and Led (120081 ). 



for all 



Xj G region r, and /3r = E(/3|Fr, Z^, rest) is given in closed form by 

Conditional on a particular tree T, the expressions in Eq. (JSj) betray that the predictive 
surface is discontinuous across the partition boundaries of T. However, in the aggregate of 
samples collected from the joint posterior distribution of (T,6), the mean tends to smooth 
out near likely partition boundaries as the tree operations grow, prune, change, and swap 
integrate over trees and CPs with large posterior probability. Uncertainty in the posterior for 
T translates into higher posterior predictive uncertainty near region boundaries. When the 
data actually indicate a non-smooth process, the treed CP retains the flexibility necessary 
to model discontinuities. When the data are consisten t with a continuou s proc ess, the treed 



CP fits are almost indistinguishable from continuous (j Gramacy and Lee 



20081). 
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3.2 Categorical inputs 

Classical treed methods, such as CART, can cope quite naturally with categorical, binary, 
and ordinal inputs. For example, categorical inputs can be encoded in binary, and splits 
can be proposed with rules such as Xj, < 1. Once a split is made on a binary input, 
no further process is needed, marginally, in that dimension. Ordinal inputs can also be 
coded in binary, and thus treated as categorical, or treated as real-valued and handled in 
a default way. CP regression, however, handles such non-real-valued inputs less naturally, 
unless (perhaps) a custom and non-standard form of the covariance function is used (e.g.. 



Qian et al. 



20071 ). When inputs are scaled to lie in [0, 1], binary- valued inputs Xp are always 
a constant distance apart — at the largest possible distance in the range. When using a 
separable correlation function, parameterized by length-scale parameter dp, the likelihood 
will increase as dp gets large if the output does not vary with Xp and will tend to zero if 
it does (in order to best approximate a step function correlation). Moreover, the binary 
encoding of even a few categorical variables (with several levels) would cause a proliferation 
of binary inputs which would each require a unique range parameter. Mixing in the high 
dimensional parameter space defining the CP correlation structure would consequently be 
poor. Clearly, this functionality is more parsimoniously achieved by partitioning, e.g., using 
a tree. However, trees with fancy regression models at the leaves pose other problems, as 
discussed below. 

Rather than manipulate the CP correlation to handle categorical inputs, the tree presents 
a more natural mechanism for such binary indicators. That is, they can be included as 
candidates for treed partitioning but ignored when it comes to fitting the models at the 
leaves of the tree. They must be excluded from the CP model at the leaves since, if ever a 
Boolean indicator is partitioned upon, the design matrix (for the CP or LM) would contain a 
column of zeros or ones and therefore would not be of full rank. The benefits of removing the 
Booleans from the CP model(s) go beyond producing full-rank design matrices at the leaves of 
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the tree. Loosely speaking, removing the Boolean indicators from the GP part of the treed 
GP gives a more parsimonious model. The tree is able to capture all of the dependence 
in the response as a function of the indicator input, and the GP is the appropriate non- 
linear model for accounting for the remaining relationship between the real-valued inputs 
and outputs. Further advantages to this approach include speed (a partitioned model gives 
smaller covariance matrices to invert) and improved mixing in the Markov chain when a 
separable covariance function is used since the size of the parameter space defining the 
correlation structure would remain manageable. Note that using a non-separable covariance 
function in the presence of indicators would result in a poor fit. Good range [d) settings 
for the indicators would not necessarily coincide with good range settings for the real- valued 
inputs. Finally, the treed model allows the practitioner to immediately ascertain whether 
the response is sensitive to a particular categorical input by tallying the proportion of time 
the Markov chain visited trees with splits on the corr esponding binary in dicator. A much 



Saltelli et al. 



20081 ) would otherwise 



more involved Monte Carlo technique (e.g., following 
be required in the absence of the tree. 

If it is known that, conditional on having a treed process for the binary inputs (encoding 
categories), the relationship between the remaining real- valued inputs and the response is 
stationary, then we can improve mixing in the Markov chain further by ignoring the real- 
valued inputs when proposing tree operations. In Section 14.11 we shall illustrate the benefit 
of the treed approach to categorical inputs in the regression context. 

There is a possible drawback to allowing (only) the tree process to govern the relationship 
between the binary predictors and the response; any non-trivial treed partition would force 
the GP part of the model to relate the real-valued inputs and the response separately in 
distinct regions. By contrast, one global GP — with some mechanism for directly handling 
binary inputs — aggregates information over the entire predictor space. We can imagine a 
scenario where the correlation structure for the real-valued inputs is globally stationary, but 



12 



the response still depends (in part) upon the setting of a categorical input. Then partitioning 
upon that categorical input could weaken the GP's ability to learn the underlyin g stationary 



proces s governing the real-valued inputs. In this case, the approach described by 



Qian et al. 



(120071 ). which explicitly accounts for correlation in the response across different values of the 
categorical predictors, may be preferred. However, we shall argue in Section 13.31 that in the 
classification context this possibility is less of a worry because the GP part of the model is 
only a prior for the latent Z-values, and therefore its influence on the posterior is rather weak 
compared to the class labels C. Such small influences take a back seat to the (by comparison) 
enormous speed and interpretability benefits that are offered by treed partitioning on binary 
inp uts as illust r ated empirically in Sections 14.21 and 14.31 These benefits are not enjoyed by 



the 



Qian et al. 



( 120071 ) approach. 



3.3 TGP for classification 

Recall that in order to adapt the GP for classification (Section [2]), we introduced M — 1 sets 
of latent variables Z„i so that each predictor X corresponds to M — 1 latent variable values. 
The class C {X) then has a single-draw multinomial distribution with probabilities obtained 
via the softmax function ([T]) applied to these latent variables. 

3.3.1 Possible model formulations 

In the TGP case, however, it is not immediately clear how these latent variables fit into the 
tree structure. We can imagine at least two possibilities. First, recall that in the RTGP 
model, the tree partitions the predictor space into regions. In each region, we fit a GP. One 
latent variable extension to this model would be to fit a CGP at each leaf of the tree. Instead 
of a single GP, each leaf CGP consists of M — 1 GPs. The output of these GPs would be the 
M — 1 latent variable sets in that particular region. Together, the output over all regions 
would form the entire set of latent variables. Since this model features just one tree, we call 
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it an OTGP. 

Alternatively, we might not choose to generate a CGP at each leaf of a single tree. Rather, 
we could recall that the CGP amalgamates latent variables from M — 1 real- valued models. 
Instead of the M — 1 GPs of the CGP, then, we could generate M — 1 independent, full TCP 
models. The output of the m}^ TCP would be the full set of latent variables Zm for the m^^ 
class. Since this model features multiple, independent trees, we call it an MTGP. Note that 
in this latter case, the trees need not all have the same splits. We can imagine M — 1 perfect 
copies of X indexed by class. Then {Xm, Zm) may be partitioned differently for each m. In 
the OTGP case, on the other hand, [X^., Z^) is partitioned into exactly the same regions 
across all m. 

We choose to focus on the MTGP in the following analysis as we expect it to exhibit better 
mixing than the OTGP when the number of classes is greater than two. For a particular 
choice of prior and hierarchical model, the OTGP may be thought of as an MTGP where 
the trees are all constrained to split on the same variable-value pairs. The MTGP may thus 
more easily model natural splits in a set of data since each tree can be different. Imagine 
a data set where class 1 occurs predominantly in region 1, class 2 occurs predominantly in 
region 2, and class 3 in region 3. As we will see in Section |42| a single split in each of the 
two MTGP trees may be sufficient to capture these divisions. For OTGPs, however, the 
overarching tree in the same case would require two splits to represent the partition. But a 
greater number of splits is less likely under the given tree prior (Section I3.1.ip . 

Moreover, the final splits in the MTGP are more interpretable since we immediately see 
which class label was relevant for a given split. In the OTGP, a split may be primarily 
relevant to a particular class label or set of class labels, but it will still be applied across 
all class labels. Along similar lines, when a grow or change move is proposed in the tree, 
evidence from all of the GPs of the OTGP accumulates for or against the move. Thus it is 
less likely that the acceptance probability will be high enough for such a move to take place 
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in the OTGP. But for the MTGP, only the latent variables of class m directly influence the 
proposal probability of grows or changes in the mth tree. So it is easier for the MTGP trees 
to grow and thereby take advantage of the tree structure in the model. 

Finally, as a practical consideration, the MTGP is more directly implemented as an 
extension of the RTGP. The combination of tree and latent variable sets for each class can 
be essentially the same — depending on hierarchical model choice in the classification case as 
the existing RTGP formulation with the latent variables as the outputs. Thus, the MTGP 
model may be coded by making minor modifications to the the tgp package. Since we 
henceforth use only the MTGP, we will refer to it as a CTGP (TGP for classification) in 
analogy to the acronym RTGP. 



3.3.2 Hierarchical model 



~ Multi(l,p(")) 
pt^ oc exp(-Z„(xW)) 


CTGP 


Z \B a"^ K Mak ^{F B a"^ K ) '^^HM-i),i:Rm 

^m,r\Hm,r) '~'m,r^ ^^m,r ■' ^ W (m,r) V-"- m,rh'm,r) '-^ m,r ni,r J 
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) 


f^m,rWm,rJ ^m,r^ f^mfi ~ -^P+l {f^mfl, ^m,r'^m,r^rn) 

a^^, ~ Xg (a J2, qj2) (3^,0 ~ A/'p+il/i, B) W^^ ~ W ((pV 
^m,r ~ {ar/2, {Region^ ,.}^^^ ~ Tree-Prior(aT, /3t 



Figure 1: CTGP hierarchical model with CP, CGP, and TGP elements emphasized. 



The hierarchical model for the CTGP model is straightforward. Given data {X,C), 
we introduce latent variables {Zm}m=i- There is one set for each class or, equivalently, 
one set for each tree. Each of the M — 1 trees divides the space into an independent 
region set of cardinality Rm- And each tree has its own, independent, RTGP prior where 
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the hyperparameters, parameters, and latent variable values — for fixed class index m — are 
generated as in Eq. ([2]). While this full model is identical to the RTGP case, we prefer a 
constant mean for the classification task as the linear mean incorporates extra degrees of 
freedom with less of a clear benefit in modeling flexibility. Finally, the generative model for 
the class labels incorporates the latent variables from all M — 1 trees. The softmax function 
appears just as in the CGP case ([T]). Figured] summarizes the full model via a plate diagram. 
The devil is, of course in the details, which follow. 

3.3.3 Estimation 

To approximate the joint distribution of the M — 1 TGPs, we sample with MCMC as in 
Section [3.1.21 to a large extent. Sampling is accomplished by visiting each tree in turn. For 
the m*^ class, we sequentially draw 6'm,o|rest, 6'm,r |rest for each region r of Rm, 7^|rest, and 
finally the latent variables Z^ ,. I rest for each r. The first three draws are the same as for 
the RTGP, although they may be modified as described below. The latent variable draw 
Zmr|rest is the step unique to the CTGP. 



One possi ble extension to t 
full detail in iGramacy and Lee 



l e tree moves introduced in Section I3.1.2[ and described in 



( 120081 ). is to propose new latent variables along with each 
move. In this formulation, we would accept or reject the new partition along with the newly 
proposed set of latent variables and, possibly, newly proposed GP parameters (c.f. grow). 
A difficulty with these tree moves in classification problems is that the class data C does 
not play a direct role; indeed, it is used only in defining the acceptance probability of the 
latent variable draws (see below). It is tempting to try to incorporate C more directly into 
determining the tree structure by adding latent variable moves to grow, prune, change, and 
swap (but not rotate). We could, for instance, propose Z from its prior. Then the class 
data would appear in the posterior probability factor of the MH acceptance ratio. The tree 
operations themselves are no different whether the latent variables are proposed or not. 
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However, some experimental trials with this modification suggested that the resultant 
mixing in tree space is poor. Intuition seems to corroborate this finding. Incorporating 
many new random variables into a proposal will, generally, decrease the acceptance ratio 
for that proposal. Certainly, when the Z are all proposed simultaneously in each leaf from 
their GP prior, it is unlikely that they will predominantly arrange themselves to be in close 
correspondence with the class labels. Even when a blocking mechanism is used, all of the 
latent variables must be accepted or rejected together — along with the proposed tree move. 
Instead of handicapping the tree moves in this way, we therefore separate out the latent 
variable proposals. Moreover, we suspect that the tree will be most helpful for categorical 
inputs, which are not dealt with in a parsimonious way by the existing CGP model. Indeed, 
the latent variables have enough fiexibility to capture the relationship between the real- valued 
inputs and the class labels in most cases. 

3.3.4 Class latent variables given the class tree and parameters 

While we cannot sample directly from Zm,rkest to obtain a Gibbs sampling draw, a par- 
ticular factorization of the full conditional for some subset of Zm,r suggests a reasonable 
proposal distribution for MH. More specifically, the posterior is proportional to a product 
of two factors: (a) the probability of the class given the latent variables at its predictor(s) 
p(C(x/)|{Zm,r(a^7)}m=i ) ^^^^ (b) the probability of the latent variable(s) given the current 
GP and other latent variables in its region •piZm.rip^i) |-^r, 6', T, Zm,r[p^-i)\ Here J is an index 
set over the predictors x, and as usual r labels the region within a particular class; — / is 
the index set of points in region r of class m that are not in /. The latter distribution is 
multivariate normal, with dimension equal to the size of the latent variable set being pro- 
posed, |/|. To condense notation in Eq. (jll), let Zj = Zm,r{xi) (similarly for F and and 
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let Kip = Km,r{xi,xjr). Then we have 



Zi ~ M\j\{z,d-^) 



0-^ = a'^i^IJ - K,I,^lK.Jj^^iK.^I,l) 



(4) 



= kjj> + T^FjWFj, 



The above is essentially a condensation and slight generalization of the predictive distribution 
for the real- valued output of an RTGP (IHl). To complete the descriptioii , we again slightly 
generalize the previous definitions of and /3 (IGramacy and Led . l2008l ). Here, /?/ and Vj 
are defined by their role in the distribution /3\Xi, Zj,9,T ~ A/p+i(/3/, j). The explicit 
formulas are as follows. 



VrJ = FJKy}Fj + W-'/t^ 
Pi = V^^,iFjKY}Zj + W-'Po/r^) 



We sample from the latent variable distribution with MH steps. Such a step begins with 
a proposal of new latents Z' from the prior for Z conditional on the GP parameters, i.e., 
following Eq. (jlj). Since the prior cancels with the proposal probability in the acceptance 
ratio, the new latent variables Z' are accepted with probability equal to the likelihood ratio, 
where the unprimed Z represents the current latent variable values. This leaves: 

^ ^ exp(-Z;^,.(xj)) ^ Em'=iexp(-Z„,/,,.(xj)) 

It is not necessary to propose all of the latent variables in a region at once. We may 
employ a blocking scheme whereby we propose some subset of the latent variables in region 
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r conditioned on the rest; we keep proposing for different (mutually exclusive) blocks until 
we have iterated over all of the region's latent variables, in each step conditioning on the 
accepted/rejected latents from the previously processed blocks and those yet to be proposed 
(i.e., having the values obtained in the previous iteration of MCMC). There is a natural 
trade-off in block size. Proposing Zm,r all at once leads to a small acceptance ratio and poor 
mixing. But proposing each z individually may result in only small, incremental changes. 
An advantage of the treed partition is that it yields a natural blocking scheme for updating 
the latent variables by making latents in a particular region independent of those in other 
regions. While it may be sensible to block further within a leaf, the automatic blocking 
provided by the treed partition is a step forward from the CGP. 

3.3.5 Prediction 

The most likely class label at a particular predictor value corresponds to the smallest — or 
most negative — of the M latent variables obtained at that predictor; this observation follows 
from the softmax generative model for the class labels given in Eq. ([1]). Recall that the M^^ 
latent is fixed at zero. We predict the class labels by keeping a record of the predicted class 
labels at each round of the Monte Carlo run and then take a majority vote upon completion. 
We may furthermore summarize the posterior multinomial distribution over the labels by 
recording the proportion of times each class label had the lowest latent value over the entire 
Markov chain, obtaining a full accounting of our predictive uncertainty in a true and fully 
Bayesian fashion. 

4 Illustrations and empirical results 

Before illustrating the CTGP algorithm and comparing it to CGP on real and synthetic 
data, we shall illustrate the use of categorical inputs in the regression context. Two examples 
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without categorical inputs then follow in order to illustrate and benchmark CTGP against 
CGP when the inputs are real- valued. We conclude the section with a real data set involving 
credit card applications and exhibiting both types of inputs. One aim there will be to further 
highlight how an appropriate handling of categorical inputs via trees facilitates faster and 
more accurate Bayesian inference. But we also show how this approach aids in identification 
of a particular categorical feature to which the response (credit card approval) is sensitive. 
This last type of output would have been difficult to elicit in the classical CGP setup. 



4.1 Categorical inputs for regression TGP 



Consider the fo 


lowinj 


3.5 of 


Gramacv. 


2007) 



owin g modification of the Friedman data (IFriedman 



1991 



see also Section 
'xi,X2, . . . ,Xio}) 



with one categorical indicator / G {1, 2, 3, 4} that can be encoded in binary as 



(0,0,0,1) 



(0,0,1,0) 



3^(0,1,0,0) 



4 = (1,0,0,0). 



Now let the function that describes the responses (Z), observed with standard normal noise, 
have a mean 



E{Z\x,I) 



10 sin(7rxiX2) 
20(x3 -0.5)2 

10X4 + 



if J = 1 

if J = 2 
if J = 3 



lOxi + 5x2 + 20(x3 - 0.5)2 _^ iQ sin(7rx4X5) if J = 4 



(5) 



that depends on the indicator /. Notice that when 1 = 4 the original Friedman data is 
recovered, but with the first five inputs in reverse order. Irrespective of /, the response 
depends only upon {xi,...,X5}, thus combining nonlinear, linear, and irrelevant effects. 
When / = 3 the response is linear in x. In the experiments that follow we observe Z{x) with 
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i.i.d. A/'(0, 1) noise. 

Note that the encoding above precludes sphts of the indicator variable wit h exactly two 



value s on each side. As has been noted previously in the context of CART (IHastie et al. 



2OOII ). a full enumeration of non-ordinal, categorical predictor splits is exponential in the 
range of categorical predictor values. The encoding above — which generalizes to an arbitrary 
number of predictor values — is more limited; the growth in the number of splits is linear in 
the number of predictor values. Observe that the full and restricted encodings differ only 
for predictor ranges of size at least four. Arbitrary groupings of predictor values are still 
achievable — although the remaining values will not necessarily form a single, opposing group. 

RMSE on Friedman data 
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Figure 2: Summarizing root mean squared error (RMSE) obtained with various regression 
models on 100 repeated experiments with the adjusted Friedman data ([5]), each with a 
random training set of size 500 and a random test set of size 1000. 



Figure [2] compares boxplots and summaries of the root mean squared error (RMSE) 
obtained for various regression algorithms (in the TGP class) when trained on 500 points 
from ([5]) sampled uniformly at random in [0,1]^'^ x {1,2,3,4} and tested on a similarly 
obtained hold-out test set of size 1000. The results and behavior of the various methods are 
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discussed below. 

One naive approach to fitting this data would be to fit a treed GP model ignoring the 
categorical inputs. But this model can only account for the noise, giving high RMSE on a 
hold-out test set, and so is not illustrated here. Clearly, the indicators must be included. 
One simple way to do so would be to posit a Bayesian CART model (bcart in the Figure). 
In this case the indicators are treated as indicators (as is appropriate) but in some sense so 
are the real-valued inputs since only constant models are fit at the leaves of the tree. As 
expected, the tree does indeed partition on the indicators and the other inputs. 

One might hope for a much better fit from a treed linear model (btlm) to the data since 
the response is linear in some of its inputs. Unfortunately, this is not the case. When a linear 
model with an improper prior (the tgp default) is used at the leaves of the tree, the Boolean 
indicators could not be partitioned upon because such a proposal would cause the design 
matrices at the two new leaves to become rank-deficient (they would each, respectively, 
have a column of all zeros or all ones). A treed GP would have the same problem. While 
such models offer a more parsimonious representation of a smoothly varying response (i.e., 
not piecewise constant) compared to CART, a penalty is paid by the inability to partition; 
improvements in modeling the real-valued predictors (via, for instance, the linear model) 
balance against the cost of sacrificing the categorical predictors. Ultimately, the net result 
of these two effects translates into only marginal improvements in predictive performance 
over Bayesian CART. 

One possible remedy is to employ a proper prior at the leaves. But that has its own 
drawbacks. Instead, consider including the Boolean indicators as candidates for treed par- 
titioning, but ignoring them when it comes to fitting the models at the leaves of the tree. 
This way, whenever the Boolean indicators are partitioned upon, the design matrix (for the 
GP or LM) will not contain the corresponding Boolean column and therefore will be of full 
rank. To implement this scheme, and similar ones to follow, we use the basemax argument 
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to the b* functions in tgp as described by 



Gramacy and Taddyl (120091 ) . Consider the treed 



hnear model again (btlm2) under this new treatment. In this case the MAP tree does in- 
deed partition on the indicators in an appropriate way — as well as on some other real- valued 
inputs — and the result is the lower RMSE we desire. 

A more high-powered approach would be to treat all inputs as real-valued by fitting a 
GP at the leaves of the tree (btgp). Binary partitions are allowed on all inputs, X and /. 
As in a direct application of the BGART linear model, treating the Boolean indicators as 
real-valued is implicit in a direct application of the TGP for regression. However, we have 
already seen that this representation is inappropriate since it is known that the process does 
not vary smoothly over the and 1 settings of the four Boolean indicators representing the 
categorical input /. Since the design matrices would become rank-deficient if the Boolean 
indicators were partitioned upon, there is no partitioning on these predictors. As there are 
large covariance matrices to invert, the MCMC inference is very slow. Still, the resulting fit 
(obtained with much patience) is better than the Bayesian CART and treed LM ones, as 
indicated by the RMSE. 

We expect to get the best of both worlds if we allow partitioning on the indicators but 
guard against rank deficient design matrices by ignoring these Boolean indicators for the GP 
models at the leaves (btgp2). Indeed this is the the indicators then become valid 

candidates for partitioning. 

Finally it is known that, after conditioning on the indicators, the data sampled from 
Friedman function are well-modeled with a GP using a stationary covariance structure and 
linear mean. We can use this prior knowledge to build a more parsimonious model. Moreover, 
the MCMC inference for this reduced model would have lower Monte Carlo error, since it 



would bypass attempts to partition on the real-valued inputs. These features are fac i 



by the splitmin argument to the b* functions, as described by 



Gramacv and Taddvl f l2009h . 



itated 



The lower RMSE results for this model (btgp3) corroborate the analysis above. 



23 



4.2 Classification TGP on synthetic data 
4.2.1 Simple step data in Id 

We begin with a trivial classification problem to illustrate the difference between the CTGP 
and CGP models. We consider a generative model that assigns points less than —2/3 to 
class 0, points between —2/3 and 2/3 to class 1, and points greater than 2/3 to class 2. The 
data set predictors are 60 points distributed evenly between —2 and 2, inclusive. This "step" 

step data classes 



I \ I I I 

-2-10 1 2 

X 

Figure 3: Class labels for the step data 

data is depicted in Figure [3l Label the points in order from xq = — 2 to X59 = 2. Clearly, 
we would expect CART and BCART to form a tree with one partition between xig and X20 
and another between X39 and X40, perfectly dividing the classes and correctly predicting all 
points outside the data set that do not fall between these dividing points. 

The CGP model likewise captures the pattern of the data but does so without the par- 
tition mechanism. In particular, running the CGP on this data returns two sets of latent 
variables, one set corresponding to class and the other corresponding to class 1. Note that 
one class label — here class 2 — will always be represented by latent variables that are fixed 
to zero. Otherwise, the model is overdetermined (Section |2]). 

For the first two classes in the CGP model, the latent variable values are depicted in 
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step data latents: class 



Step data latents: class 1 



-10 12 
X 

Figure 4: Latent variables obtained for the step data by the CGP for all non-trivial classes 
(i.e. and 1). Median over MCMC rounds shown as solid blue line; 90% interval shown as 
dashed, red lines. 

Figure ID As expected, the latent variable values, which follow a Gaussian process prior, 
vary smoothly across the predictor space. Since these values enter into a softmax function 
to determine class probabilities in the likelihood, smaller values roughly correspond to a 
higher posterior probability for a particular class. Thus, we see that the latent variables 
for class mark out a negative-valued region corresponding to the training points in class 
0. There is no real distinction, however, amongst the remaining latent variable values for 
this class. Likewise, the latent variable values for class 1 mark out a negative- valued region 
corresponding to the training points in class 1, and the remaining latent variable values — 
representing data in other classes — are all positive and of similar modulus. 

The CTGP model also captures the pattern of the data. But in this very simple example, 
it reduces almost exactly to BCART. Just as for CGP, there are two sets of latent variables — 
one set for class and one set for class 1 (Figure [5]). Unlike the smooth variation of the CGP 
latent variables, the CTGP takes advantage of the treed structure and limiting constant 
model. Thus, the latent variables in class exhibit a sharp divide between xig and X2o- This 
divide sets out a negative region for 0-labeled latent variables and sets the remaining latent 




-2-10 1 2 
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step data latents: class 



Step data latents: class 1 



•2-10 1 2 

X X 

Figure 5: Latent variables obtained for the step data by the CTGP for all non-trivial classes 
(i.e. and 1). Median over MCMC rounds shown as solid blue line; 90% interval shown as 
dashed, red lines. 

values at a constant positive value. Similarly, the class 1 latent variables exhibit a sharp 
divide between 0:39 and 2:40. In contrast to the CGP, the divide in class 1 latent variables 
here separates the data in classes and 1 from the class 2 data and allows the divide in 
class latent variables to take care of separating the class data from classes 1 and 2. This 
difference likely arises from the prior penalty on larger trees. A partition that separates class 
1 data from the other two classes would require a tree of height two; the existing tree of 
height one accomplishes the same task with a more parsimonious representation. 

helght=2, log(p)=-8g.9066 helght=2, log(p)=-93.6361 

x1 <> -0.71 1 864 x1 0.644068 



® (2) ® '-^^ 

2.4792 1.2824 0.4305 1.7912 

20obs 40obs 40obs 20 obs 

Figure 6: Trees from the CTGP for the step data. Left: class 0; Right: class 1 
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Indeed, we can plot the maximum a posteriori trees of height greater than one for both 
classes (Figure |6]) to verify that the divisions here are (a) caused by treed partitioning and 
(b) in exactly the places we would expect given the training data. The figure demonstrates 
that both of these hypotheses are indeed true. 

This example illustrates that the CTGP is capable of partitioning on continuous- valued 
data in practice as well as theory. Nonetheless, this data is rather contrived and simplistic; 
with noisier data, it is much less likely for real-valued partitioning to occur given our current 
prior. And the partitioning does not fundamentally change the predictive accuracy of the 
model. However, it does allow for a much clearer interpretation. The treed partitioning 
in this example has the same value for intuition as BCART trees whereas the CGP latent 
variables, while powerful for modeling, are opaque in meaning. 

Finally, we see even in this small example the running-time benefits of the CTGP over 
the CGP. While the latter took 52.28 seconds to run, the former elapsed only 24.22 seconds, 
a speedup of roughly a factor of two. 

4.2.2 2d Exponential data 

While the CTGP easily partitions categorical inputs, as illustrated with real data in Section 



14. 3[ it can be difficult to manufacture non-trivial synthetic classification data sets that are 
obvious candidates for partitioning amongst the real- valued inputs. In contrast, it can be 
quite easy to manufacture regression data which requires a nonstationary CP model. Here we 
demonstrate how a simple Hessian calculation can convert nonstationary synthetic regression 
data into class labels that are likely to benefi t from partitioni ng. Consider the synthetic 2d 



exponential data used to illustrate RTGP by 
[—2, 6], and the true response is given by 



Gramacyl (120071 ). The input space is [—2,6] x 



z{x) = Xiexp{—xl — xl). (6) 
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In the regression example a small amount of Gaussian noise (with sd = 0.001) is added. To 
convert the real-valued outputs to classification labels we calculate the Hessian: 



H{xi,X2) 



2x1 (2x? - 3)exp( 
2x2(2x1 - l)exp( 



-xf 



-X, 



xl) 2x2(2xf - l)exp( 



X2 ^2 
1 -^2 



2xi{2xl - l)exp( 



Then, for a particular input {xi,X2) we assign a class label based on the sign of the sum of 
the eigenvalues of H{xi,X2), indicating the direction of concavity at that point. A function 
like the 2d exponential one (E]) whose concavity changes more quickly in one region of the 
input space than in another (and is therefore well fit by an RTGP model) will similarly have 
class labels that change more quickly in one region than in another and will similarly require 
a treed process for the best possible fit. Figure [7] shows the resulting class labels. Extensions 



2d Exponential Classes 




2d Exponential Latents 



x1 



2 
xl 



Figure 7: Left: 2d exponential classification data obtained by the Hessian eigen-method; 
Right: Mean latent variables obtained by CTGP 

to multiple class labels (up to m{m— l)/2 of them) can be accommodated by a more general 
treatment of (the signs of) the components of the Hessian. 

Overlaid on the plot in Figure [7] (/e/t) is the maximum a posteriori tree — a single split — 
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encountered in the trans-dimensional Markov chain samphng from the CTGP posterior. Here 
the tree is clearly separating the interesting part of the space, where class labels actually 
vary, from the relatively large area where the class is constant. As in the regression case, by 
partitioning, the CTGP is able to improve upon predictive performance as well as compu- 
tational speed relative to the full CGP model. We trained the classifier(s) on {X, C) data 
obtained by a maximum entropy design of size = 400 subsampled from a dense grid of 
10000 points and calculated the misclassification rate on the remaining 9600 locations. The 
rate was 3.3% for CGP and 1.7% for CTGP, showing a relative improvement of roughly 50%. 
CTGP wins here because the relationship between response (class labels) and predictors is 
clearly nonstationary. That is, no single collection of range parameters {di,d2} is ideal for 
generating the latent Z-values that would best represent the class labels through the soft- 
max. However, it is clear that two sets of parameters, for either side of Xi = 2 as obtained 
by treed partitioning, would suffice. The speed improvements obtained by partitioning were 
even more dramatic. CGP took 21.5 hours to execute 15000 RJ-MCMC rounds whereas 
CTGP took 2.0 hours, an over 10-fold improvement. 



4.3 Classification TGP on real data 

The previous experiment illustrates, among other things, treed partitions on real-valued 
predictors in the classification context. However, a strength of the RTGP demonstrated in 
Section 14.11 is that it can offer a more natural handling of categorical predictors compared to 
a CP. With this observation in mind, we shall compare the CTGP to the CGP (as well as 
neural networks and recursive partitioning) on real data while restricting the treed partitions 
in the CTGP to the categorical inputs. 

For this evaluat ion, we considered the Credit Approval data set from the UCI Machine 



Learning database (lAsuncion and Newman 



20071 ). The set consists of 690 instances grouped 



into two classes: credit card application approval ('+') and application failure ('-')• The 
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names and values of the fifteen predictors for each instance are confidential. However, aspects 
of these attributes relevant to our classification task are available. E.g., we know that 
six inputs are continuous, and nine are categorical. Among the categorical predictors, the 
number of distinct categories ranges from two to fourteen. After binarization, we have a 
data set of 6 continuous and 41 binary predictors. The CGP treats these all as continuous 
attributes. The CTGP forms GPs only over the six continuous attributes and partitions on 
(and only on) the 41 binary attributes. 

We also considered two other standard classification algorithms. First, we used the neural 
networks (NN) algorithm as implemented by the nnet function available in the nnet library in 



R; we set the number o 



documentation (IRipley 



hidden layer units to two in accordance with all examples in the nnet 



20091 ) and otherwise kept the default setting s. Second, we used recur- 



sive p artitioning (RP) as implemented by the rpart function in R (ITherneau and Atkinson 
201ok 



Method 


Avg 


Stdev 


NN 


24.4 


11.4 


RP 


15.4 


3.9 


CGP 


14.6 


4.0 


CTGP 


14.2 


3.6 



Table 1: Average and standard deviation of misclassification rate on the 100 credit-data 
folds for various methods, listed in order of performance from worst to best mean. 



Our comparison consists of ten separate 10-fold cross-validations for a total of 100 folds. 
The results are summarized in Tabled! The CTGP offers a slight improvement over the CGP 
with an average misclassification rate of 14.2%, compared to 14.6%. Similarly, the standard 
deviation of CGP misclassification across folds was 4.0% while the CTGP exhibited slightly 
smaller variability with a standard deviation of 3.6%. More impressive is the speed-up offered 
by CTGP. The average CPU time per fold used by the CGP method was 5.52 hours; with 
an average CPU time per fold of 1.62 hours, the CTGP showed a more than three- fold 
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height=2, log(p)=-936.458 height=3, log(p)=-1 049.65 

x38 <> x38 <> 

® x45 <> 

1.1622 I 

280 obs 
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0.4411 1.9686 0.7287 1.5603 

280 obs 307 obs 289 obs 18 obs 

Figure 8: Trees from CTGP for 2d exponential data; no height one 
improvement. 

Finally, the interpretative aspect of the CTGP, with an appropriate (treed) handling of 
the categorical inputs, is worth highlighting. For a particular run of the algorithm on the 
Credit Analysis data, the MAP trees of different heights are shown in Figure [HI These trees, 
and others, feature principal splits on the 38**^ binary predictor, which is a binarization of 
the 9*^^ two-valued categorical predictor. Therefore, the CTGP indicates, without additional 
work, the significance of this variable in predicting the success of a credit card application. To 
extract similar information from the CGP, one would have to devise and run some additional 
tests — no small feat given the running time of single CGP execution. The figure also shows 
that the Markov chain visited other trees, including one with a split on the 45**^ binary 
predictor. However a comparison of log posteriors (shown in the Figure) and a trace of the 
heights of the trees encountered in the Markov chain (not shown) indicate that this split, 
and any others, had very low posterior probability. 

5 Discussion 

In this paper we have illustrated how many of the benefits of the regression TCP model 
extend to classification. The components of TCP, i.e., treed models and CPs, have separately 
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enjoyed a long tenure of success in application to classification problems. In the case of the 
GP, M — 1 processes are each used as a prior for latent variables, which encode the classes via 
a softmax function. While the CGP is a powerful method that typically offers improvements 
over simpler approaches (including treed models), drawbacks include an implicit assumption 
of stationarity, clumsy handling of categorical inputs, and slow evaluation due to repeated 
large matrix decompositions. In contrast, the treed methods are thrifty, provide a divide- 
and-conquer approach to classification, and can naturally handle categorical inputs. Two 
possible ways of combining the GP with a tree process naturally suggest themselves for, 
first, partitioning up the input space and thus taking advantage of the divide- and-conquer 
approach to a nonstationary prior on the latent variables used for classification and, second, 
handling categorical inputs. 

We argued that it is better to work with M — 1 separate TGP models rather than deal 
with one treed model with M — 1 GPs at each leaf. However, a drawback of this (MTGP) 
approach is that, when there are no (useful) real- valued predictors, then it will behave like 
BCART. But what results will be an inefficient implementation due to all of the latent 
variables involved — as the MCMC basically tries to integrate them out. So in this case, 
a direct implementation via BCART is preferred. However, when mixed categorical and 
real- valued inputs are present, the combined tree and GP approach allows the appropriate 
component of the model (tree in the case of categorical inputs, and GP in the case of real 
ones) to handle the marginal process in each input dimension. The result is a classification 
model that is speedy, interpretable, and highly accurate, combining the strengths of GP and 
treed models for classification. 
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